198 research outputs found

    Shortest prefix strings containing all subset permutations

    Get PDF
    AbstractWhat is the length of the shortest string consisting of elements of {1,…n} that contains as subsequences all permutations of any k-element subset? Many authors have considered the special case where k=n. We instead consider an incremental variation on this problem first proposed by Koutas and Hu. For a fixed value of n they ask for a string such that for all values of k⩽n, the prefix containing all permutations of any k-element subset as subsequences is as short as possible. The problem can also be viewed as follows:For k=1 one needs n distinct digits to find each of the n possible permutations. In going from k to k+1, one starts with a string containing all k-element permutations as subsequences, and one adds as few digits as possible to the end of the string so that the new string contains all (k+1)-element permutations.We give a new construction that gives shorter strings than the best previous construction. We then prove a weak form of lower bound for the number of digits added in successive suffixes. The lower bound proof leads to a construction that matches the bound exactly. The length of a shortest prefix string is k(n−2)+[13(k+1)]+3, for k > 2.The lengths for k=1, 2 are n and 2n−1. This proves the natural conjecture that requiring the strings to be prefixes strictly increases the length of the strings required for all but the smallest values of k

    Dynamic dictionary matching with failure functions

    Get PDF
    AbstractAmir and Farach (1991) and Amir et al. (to appear) recently initiated the study of the dynamic dictionary pattern matching problem. The dictionary D contains a set of patterns that can change over time by insertion and deletion of individual patterns. The user may also present a text string and ask to search for all occurrences of any patterns in the text. For the static dictionary problem, Aho and Corasick (1975) gave a strategy based on a failure function automaton that takes O(|D|log|Σ|) time to build a dictionary of size |D| and searches a text T in time O(|T|log|Σ|+tocc), where tocc, is the total number of pattern occurrences in the text.Amir et al. (to appear) used an automaton based on suffix trees to solve the dynamic problem. Their method can insert or delete a pattern P in time O(|P|log|D|) and can search a text in time O((|T|+tocc)log|D|).We show that the same bounds can be achieved using a framework based on failure functions. We then show that our approach also allows us to achieve faster search times at the expense of the update times; for constant k, we can achieve linear O(|T|(k+log|Σ|)+k tocc) search time with an update time of O(k|P∥D|1k). This is advantageous if the search texts are much larger than the dictionary or searches are more frequent than updates.Finally, we show how to build the initial dictionary in O(|D|log|Σ|) time, regardless of what combination of search and update times is used

    Evaluating annotations of an Agilent expression chip suggests that many features cannot be interpreted

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>While attempting to reanalyze published data from Agilent 4 × 44 human expression chips, we found that some of the 60-mer olignucleotide features could not be interpreted as representing single human genes. For example, some of the oligonucleotides align with the transcripts of more than one gene. We decided to check the annotations for all autosomes and the X chromosome systematically using bioinformatics methods.</p> <p>Results</p> <p>Out of 42683 reporters, we found that 25505 (60%) passed all our tests and are considered "fully valid". 9964 (23%) reporters did not have a meaningful identifier, mapped to the wrong chromosome, or did not pass basic alignment tests preventing us from correlating the expression values of these reporters with a unique annotated human gene. The remaining 7214 (17%) reporters could be associated with either a unique gene or a unique intergenic location, but could not be mapped to a transcript in RefSeq. The 7214 reporters are further partitioned into three different levels of validity.</p> <p>Conclusion</p> <p>Expression array studies should evaluate the annotations of reporters and remove those reporters that have suspect annotations. This evaluation can be done systematically and semi-automatically, but one must recognize that data sources are frequently updated leading to slightly changing validation results over time.</p

    Application of dissociation curve analysis to radiation hybrid panel marker scoring: generation of a map of river buffalo (B. bubalis) chromosome 20

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Fluorescence of dyes bound to double-stranded PCR products has been utilized extensively in various real-time quantitative PCR applications, including post-amplification dissociation curve analysis, or differentiation of amplicon length or sequence composition. Despite the current era of whole-genome sequencing, mapping tools such as radiation hybrid DNA panels remain useful aids for sequence assembly, focused resequencing efforts, and for building physical maps of species that have not yet been sequenced. For placement of specific, individual genes or markers on a map, low-throughput methods remain commonplace. Typically, PCR amplification of DNA from each panel cell line is followed by gel electrophoresis and scoring of each clone for the presence or absence of PCR product. To improve sensitivity and efficiency of radiation hybrid panel analysis in comparison to gel-based methods, we adapted fluorescence-based real-time PCR and dissociation curve analysis for use as a novel scoring method.</p> <p>Results</p> <p>As proof of principle for this dissociation curve method, we generated new maps of river buffalo (<it>Bubalus bubalis</it>) chromosome 20 by both dissociation curve analysis and conventional marker scoring. We also obtained sequence data to augment dissociation curve results. Few genes have been previously mapped to buffalo chromosome 20, and sequence detail is limited, so 65 markers were screened from the orthologous chromosome of domestic cattle. Thirty bovine markers (46%) were suitable as cross-species markers for dissociation curve analysis in the buffalo radiation hybrid panel under a standard protocol, compared to 25 markers suitable for conventional typing. Computational analysis placed 27 markers on a chromosome map generated by the new method, while the gel-based approach produced only 20 mapped markers. Among 19 markers common to both maps, the marker order on the map was maintained perfectly.</p> <p>Conclusion</p> <p>Dissociation curve analysis is reliable and efficient for radiation hybrid panel scoring, and is more sensitive and robust than conventional gel-based typing methods. Several markers could be scored only by the new method, and ambiguous scores were reduced. PCR-based dissociation curve analysis decreases both time and resources needed for construction of radiation hybrid panel marker maps and represents a significant improvement over gel-based methods in any species.</p

    Distinct Genetic Alterations in Colorectal Cancer

    Get PDF
    Colon cancer (CRC) development often includes chromosomal instability (CIN) leading to amplifications and deletions of large DNA segments. Epidemiological, clinical, and cytogenetic studies showed that there are considerable differences between CRC tumors from African Americans (AAs) and Caucasian patients. In this study, we determined genomic copy number aberrations in sporadic CRC tumors from AAs, in order to investigate possible explanations for the observed disparities.We applied genome-wide array comparative genome hybridization (aCGH) using a 105k chip to identify copy number aberrations in samples from 15 AAs. In addition, we did a population comparative analysis with aCGH data in Caucasians as well as with a widely publicized list of colon cancer genes (CAN genes). There was an average of 20 aberrations per patient with more amplifications than deletions. Analysis of DNA copy number of frequently altered chromosomes revealed that deletions occurred primarily in chromosomes 4, 8 and 18. Chromosomal duplications occurred in more than 50% of cases on chromosomes 7, 8, 13, 20 and X. The CIN profile showed some differences when compared to Caucasian alterations. as the most frequently amplified genes. The observed CIN may play a distinctive role in CRC in AAs

    Targeted genomic analysis reveals widespread autoimmune disease association with regulatory variants in the TNF superfamily cytokine signalling network.

    Get PDF
    BACKGROUND: Tumour necrosis factor (TNF) superfamily cytokines and their receptors regulate diverse immune system functions through a common set of signalling pathways. Genetic variants in and expression of individual TNF superfamily cytokines, receptors and signalling proteins have been associated with autoimmune and inflammatory diseases, but their interconnected biology has been largely unexplored. METHODS: We took a hypothesis-driven approach using available genome-wide datasets to identify genetic variants regulating gene expression in the TNF superfamily cytokine signalling network and the association of these variants with autoimmune and autoinflammatory disease. Using paired gene expression and genetic data, we identified genetic variants associated with gene expression, expression quantitative trait loci (eQTLs), in four peripheral blood cell subsets. We then examined whether eQTLs were dependent on gene expression level or the presence of active enhancer chromatin marks. Using these eQTLs as genetic markers of the TNF superfamily signalling network, we performed targeted gene set association analysis in eight autoimmune and autoinflammatory disease genome-wide association studies. RESULTS: Comparison of TNF superfamily network gene expression and regulatory variants across four leucocyte subsets revealed patterns that differed between cell types. eQTLs for genes in this network were not dependent on absolute gene expression levels and were not enriched for chromatin marks of active enhancers. By examining autoimmune disease risk variants among our eQTLs, we found that risk alleles can be associated with either increased or decreased expression of co-stimulatory TNF superfamily cytokines, receptors or downstream signalling molecules. Gene set disease association analysis revealed that eQTLs for genes in the TNF superfamily pathway were associated with six of the eight autoimmune and autoinflammatory diseases examined, demonstrating associations beyond single genome-wide significant hits. CONCLUSIONS: This systematic analysis of the influence of regulatory genetic variants in the TNF superfamily network reveals widespread and diverse roles for these cytokines in susceptibility to a number of immune-mediated diseases.The Intramural Research Program of the National Institute of Arthritis and Musculoskeletal and Skin Diseases and the National Library of Medicine of the US National Institutes of Health (Intramural Research Program) , Wellcome Trust (080327/Z/06/Z, 087007/Z/08/Z, 094227/Z/10/Z, Clinical PhD Programme, 079895, 076113 and 085475) , Medical Research Council (G0400929) , National Institute for Health Research , National Institutes of Health (Oxford-Cambridge Scholars Program) , Istanbul University Research Fund and UK Behcet’s Syndrome Society.This is the final version of the article. It first appeared from BioMed Central via http://dx.doi.org/10.1186/s13073-016-0329-

    Chromosomal Alterations and Gene Expression Changes Associated with the Progression of Leukoplakia to Advanced Gingivobuccal Cancer

    Get PDF
    We present an integrative genome-wide analysis that can be used to predict the risk of progression from leukoplakia to oral squamous cell carcinoma (OSCC) arising in the gingivobuccal complex (GBC). We find that the genomic and transcriptomic profiles of leukoplakia resemble those observed in later stages of OSCC and that several changes are associated with this progression, including amplification of 8q24.3, deletion of 8p23.2, and dysregulation of DERL3, EIF5A2, ECT2, HOXC9, HOXC13, MAL, MFAP5 and NELL2. Comparing copy number profiles of primary tumors with and without lymph-node metastasis, we identify alterations associated with metastasis, including amplifications of 3p26.3, 8q24.21, 11q22.1, 11q22.3 and deletion of 8p23.2. Integrative analysis reveals several biomarkers that have never or rarely been reported in previous OSCC studies, including amplifications of 1p36.33 (attributable to MXRA8), 3q26.31 (EIF5A2), 9p24.1 (CD274), and 12q13.2 (HOXC9 and HOXC13). Additionally, we find that amplifications of 1p36.33 and 11q22.1 are strongly correlated with poor clinical outcome. Overall, our findings delineate genomic changes that can be used in treatment management for patients with potentially malignant leukoplakia and OSCC patients with higher risk of lymph-node metastasis

    PSI-BLAST pseudocounts and the minimum description length principle

    Get PDF
    Position specific score matrices (PSSMs) are derived from multiple sequence alignments to aid in the recognition of distant protein sequence relationships. The PSI-BLAST protein database search program derives the column scores of its PSSMs with the aid of pseudocounts, added to the observed amino acid counts in a multiple alignment column. In the absence of theory, the number of pseudocounts used has been a completely empirical parameter. This article argues that the minimum description length principle can motivate the choice of this parameter. Specifically, for realistic alignments, the principle supports the practice of using a number of pseudocounts essentially independent of alignment size. However, it also implies that more highly conserved columns should use fewer pseudocounts, increasing the inter-column contrast of the implied PSSMs. A new method for calculating pseudocounts that significantly improves PSI-BLAST's; retrieval accuracy is now employed by default

    Chromosomal Alterations and Gene Expression Changes Associated with the Progression of Leukoplakia to Advanced Gingivobuccal Cancer

    Get PDF
    We present an integrative genome-wide analysis that can be used to predict the risk of progression from leukoplakia to oral squamous cell carcinoma (OSCC) arising in the gingivobuccal complex (GBC). We find that the genomic and transcriptomic profiles of leukoplakia resemble those observed in later stages of OSCC and that several changes are associated with this progression, including amplification of 8q24.3, deletion of 8p23.2, and dysregulation of DERL3, EIF5A2, ECT2, HOXC9, HOXC13, MAL, MFAP5 and NELL2. Comparing copy number profiles of primary tumors with and without lymph-node metastasis, we identify alterations associated with metastasis, including amplifications of 3p26.3, 8q24.21, 11q22.1, 11q22.3 and deletion of 8p23.2. Integrative analysis reveals several biomarkers that have never or rarely been reported in previous OSCC studies, including amplifications of 1p36.33 (attributable to MXRA8), 3q26.31 (EIF5A2), 9p24.1 (CD274), and 12q13.2 (HOXC9 and HOXC13). Additionally, we find that amplifications of 1p36.33 and 11q22.1 are strongly correlated with poor clinical outcome. Overall, our findings delineate genomic changes that can be used in treatment management for patients with potentially malignant leukoplakia and OSCC patients with higher risk of lymph-node metastasis

    Retrieval accuracy, statistical significance and compositional similarity in protein sequence database searches

    Get PDF
    Protein sequence database search programs may be evaluated both for their retrieval accuracy—the ability to separate meaningful from chance similarities—and for the accuracy of their statistical assessments of reported alignments. However, methods for improving statistical accuracy can degrade retrieval accuracy by discarding compositional evidence of sequence relatedness. This evidence may be preserved by combining essentially independent measures of alignment and compositional similarity into a unified measure of sequence similarity. A version of the BLAST protein database search program, modified to employ this new measure, outperforms the baseline program in both retrieval and statistical accuracy on ASTRAL, a SCOP-based test set
    corecore